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A  BIAS  BOUND  FOR  LEAST  SQUARES 
LINEAR  REGRESSION 


Naihua  Duan  and  Ker-Chau  Li 
RAND  Corporation  and  University  of  California 


Abstract:  Consider  a  general  linear  mode)  j/=g{a+0x)+t,  where  the  link  function  g 
is  arbitrary  and  unknown.  The  maximal  component  of  (a,0)  that  can  be  identified 
is  the  direction  of  0,  which  measures  the  substitutibility  of  the  components  of  x.  If 
C((3x)=E(x|/3x)  is  linear  in  0x,  the  least  squares  linear  regression  of  y  on  x  gives  a 
consistent  estimate  for  the  direction  of  0,  despite  possible  nonlinearity  in  the  link 
function  (Brillinger  (1977,  1982)).  If  C(0x)  is  nonlinear,  the  linear  regression  might 
be  inconsistent  for  the  direction  of  0.  We  establish  a  bound  for  the  asymptotic  bias, 
which  is  determined  from  the  nonlinearity  in  C(0x),  and  the  multiple  correlation 
coefficient  for  the  least  squares  linear  regression  of  y  on  x.  According  to  the 
bias  bound,  the  linear  regression  is  nearly  consistent  for  the  direction  of  0,  despite 
possible  nonlinearity  in  the  link  function,  provided  that  the  nonlinearity  in  (i(0x) 
is  small  compared  to  .  Our  measure  of  nonlinearity  in  C(dx)  is  analogous  to  the 
maximal  curvature  studied  by  Cox  and  Small  (1978).  The  bias  bound  is  tight;  we 
give  the  construction  for  the  least  favorable  models  which  achieve  the  bias  bound. 
The  theory  is  applied  to  a  special  case  for  an  illustration. 

Key  words  and  phrases:  Lack  of  fit,  link  function,  maximal  curvature,  nonlinearity, 
projection  index,  projection  pursuit. 


1.  Main  Result 

Least  squares  linear  regression  is  one  of  the  most  widely  used  statistical 
tools.  It  is  based  on  the  standard  linear  model; 

y  =  Q  +  /?x-f€,  e  jx  ~  iV(0,(7^),  (1) 

where  y  denotes  a  scalar  outcome  variable,  and  x  denotes  a  p-dimensionaJ  column 
vector  of  regressor  variables.  In  empirical  applications,  it  is  unlikely  for  the 
standard  linear  model  to  hold  exactly.  Therefore  we  need  to  be  concerned  about 
possible  violations  of  the  model  assumptions.  For  example,  we  might  consider 
distribution  violation:  the  error  distribution  might  not  be  normal.  There  is  a 
rich  literature  on  robust  methods  for  estimating  the  linear  model  in  the  presence 
of  distribution  violation;  see,  e.g.,  Huber  (1981). 
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Another  serious  challenge  to  the  standard  linear  model  is  the  violation  of  the 
functional  form.  For  example,  the  true  model  might  be  a  power  transformation 
model;  the  working  model  (1)  might  be  based  on  a  wrong  transformation.  More 
specifically,  we  assume  the  true  model  has  the  following  form: 

y  =  ^(a  +  y3x)  +  €,  f;(€lx)  =  0,  0^0,  (2) 

where  g  is  the  link  function,  assumed  to  be  arbitrary  and  unknown.  Following 
Brillinger  (1977, 1982),  we  call  a  model  of  form  (2)  ageneral  linear  model  (GLM). 
To  avoid  trivialities,  we  assume  in  (2)  that  0  is  not  null. 

When  the  link  function  is  arbitrary  and  unknown,  any  linear  transformation 
of  Q  +  ^x  can  be  absorbed  into  the  link  function.  Therefore  we  cannot  identify 
the  intercept  a,  nor  can  we  identify  the  length  or  the  orientation  of  0.  The 
most  that  we  can  identify  is  the  direction  of  0,  i.e.,  the  collection  of  the  ratios 
{0jf0ic,  j,k  =  1,...  ,p}.  The  direction  of  0  measures  the  substitutibility  of 
the  components  of  x,  and  might  be  the  main  quantities  of  interest  in  many 
applications. 

We  will  focus  on  estimating  the  direction  of  0,  using  the  least  squares  linear 
regression  of  y  on  x.  We  are  concerned  whether  the  linear  regression  still  provides 
a  valid  estimate  for  the  direction  of  0  when  the  link  function  is  nonlinear.  We 
assume  a  mild  condition  on  the  GLM  and  the  regressor  x: 

(A.l)  The  regressor  x  is  sampled  randomly  from  a  probability  distribu¬ 
tion  Q(x);  the  following  moments  exist:  /i  =  E{x),  S  =  Cov(x),  E"*,  E(yx'), 
o*(x)  =  Var(€  I  x),  and  F^[(t*(x)xx']. 

Under  (A.l),  the  least  squares  estimate  0i.s  converges  to 

/3ls  =  Cov(p(q -L /?x),x')E"'.  (3) 

When  the  link  function  g  is  nonlinear,  0i^s  might  not  have  the  same  direction  a.s 
0.  Brillinger  (1977,  1982)  established  that  0ls  does  have  the  same  direction  as 
0,  despite  possible  nonlinearity  in  the  link  function,  provided  that  x  is  normally 
distributed.  The  result  also  holds  under  the  weaker  condition 

(A. 2)  (,{0x)  ~  E{x\0x)  is  linear  in  0x. 

Theorem  1.  Assume  that  the  random  vector  {y,x')  follows  n  GLM  (2),  and 
satisfies  (A.l)  and  (A. 2).  We  then  have  0is  «  0. 

For  empirical  applications,  it  is  unlikely  for  condition  (A. 2)  to  hold  exactly. 
When  (A.2)  fails,  0^$  might  not  have  the  same  direction  as  0.  However,  we 
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would  expect  the  noncollinearity  to  be  minor  if  C(/3x)  is  nearly  linear  in  /3x.  We 
will  quantify  the  magnitude  of  the  noncollinearity  between  and  /?,  using  a 
measure  of  nonlinearity  in  which  is  analogous  to  Cox  and  Small’s  (1978) 

maximal  curvature. 

We  measure  the  noncollinearity  between  0  and  0ls  by  the  squared  sine 
between  the  two  vectors, 

s\t0{0Jls)  =  1  -  {0LS^0'fl{0'^0'){0LS^0'Ls\ 

where  the  sine  function  is  taken  with  respect  to  the  inner  product 

(v,w)  =  vEw'.  (4) 

Note  that  1  -  s\t0{0,0ls)  is  the  squared  correlation  between  0x  and  0LSTf-  If  0 
and  0LS  are  nearly  collinear,  the  linear  regression  of  y  on  x  does  provide  useful 
information  on  the  approximate  direction  of  0.  If  the  noncollinearity  is  severe, 
the  linear  regression  might  be  misleading  and  should  be  applied  with  caution. 
Our  main  result  is  the  following  bias  bound,  which  is  proved  in  Section  2. 

Theorem  2.  Assume  the  random  vector  (y,x')  follows  a  GLM  (2),  and  satisfies 
(A.l).  The  noncollinearity  between  0  and  0ls  (3)  satisfies  the  following  oias 
bound: 

sln\0,0Ls)  <  min  |l, 

where  R}  ~  Var(/3i5x)/Var(y)  is  the  usual  for  the  least  squares  linear  regres¬ 
sion  of  y  on  x,  and  V2{0)  is  given  by 

i>2i0)  =  max  ,  I  constraint  h'E0'  =  0,  where  T  =  Cov(((/Jx)). 
b€N'  dEd 

The  scalar  i^2{0)  is  the  second  eigenvalue  for  the  spectral  decomposition  of 
T  with  respect  to  E.  (The  first  eigenvector  is  0\  the  first  eigenvalue  is  one.)  We 
now  interpret  V2{0)  as  a  measure  for  the  nonlinearity  in  C(0x).  More  specifically, 
1/2(0)  measures  the  deviation  of  ((0x)  from  the  linear  regression  function 

l(0x)  =  ^-hE0'0(x-,i)/0S0', 

where  we  take  the  linear  regression  of  x  on  0x.  Note  that  ((0x)  is  linear  if  and 
only  if  the  regression  of  bx  on  0x  is  linear  for  all  b  G  R’’.  For  any  b  G  R^, 
consider  the  decomposition 


bx  =  b/  +  (bC  -  bl)  +  (bx  -  bC), 
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where  the  first  term  is  the  linear  regression  of  bx  on  /?x,  the  second  term  is 
the  lack  of  fit  for  this  linear  regression,  and  the  third  term  is  the  pure  error. 
We  measure  the  nonlinearity  in  bC(/3x)  by  the  proportion  of  the  variance  in  bx 
accounted  for  by  the  lack  of  fit  term  above: 

LF{f3,h)  =  Var(bC  -  bO/Var(bx). 

We  then  maximize  LF{f3,h)  over  b  to  measure  the  nonlinearity  in  C(/5x)-  Since 
/3x  =  it  is  easy  to  verify  that  this  nonlinearity  measure  coincides  with 

U2i0)  =  max  LF(/3,h). 

b€H' 

The  nonlinearity  measure  1/2  (/3)  is  analogous  to  the  maximum  curvature 
studied  in  Cox  and  Small  (1978).  Cox  and  Small  considered  the  quadratic  re¬ 
gression  of  bx  on  /3x,  and  measured  the  nonlinearity  by  the  proportion  of  the 
variance  in  bx  accounted  for  by  the  quadratic  term.  They  then  maximized  the 
nonlinearity  over  b.  This  is  analogous  to  our  consideration  of  LF  and  its  maxi¬ 
mization  over  b. 

Theorem  1  follows  immediately  from  Theorem  2:  if  C(^x)  is  linear  in  0x,  we 
have  =  0,  therefore  the  right  hand  side  of  (5)  is  zero,  and  0ls  is  collinear 
with  (3.  When  C(/^x)  is  nonlinear,  Theorem  1  might  not  hold;  however,  the 
non  colli  nearity  between  /3ls  and  /3  is  small  if  the  nonlinearity  in  ((f3x),  i'2i0),  is 
small  compared  to  . 

The  bias  bound  (5)  is  tight  in  the  following  sense:  for  a  given  /?,  a  given  Q{x)y 
and  a  given  R^,  we  can  find  a  GLM  for  which  the  noncollinearity  sin^(/?,/?i,5) 
equals  the  right  hand  side  of  (5).  The  construction  of  such  a  GLM  is  given  in 
Section  2. 

For  empirical  applications,  we  need  to  estimate  the  right  hand  side  of  (5). 
If  we  have  a  good  initial  estimate  0  for  the  direction  of  0,  we  can  estimate  C(/3x) 
by  an  appropriate  nonparametric  regression  of  x  on  0x,  then  estimate  T,  V2{0), 
and  the  bias  bound.  If  we  don’t  have  a  good  initial  estimate  for  the  direction  of 
3,  we  can  take  a  conservative  approach  and  replace  i'2{0)  in  (5)  by  its  maximum 

=  max  U2{0). 

For  each  0  6  R^,  we  estimate  C(/^x)»  then  estimate  T  and  t'2(/9).  We  then 
search  for  the  maximum  of  the  estimated  i'2{0ys.  This  is  a  projection  pursuit 
problem,  with  i>2{0)  as  the  projection  index.  Huber  (1985  and  discussions  )  gave 
a  comprehensive  review  of  the  projection  pursuit  problem.  Cox  (1985)  suggested 
using  the  maximum  curvature  in  Cox  and  Small  (1978)  as  the  projection  index. 
This  is  analogous  to  using  i'2i0)  as  the  projection  index. 
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We  prove  Theorem  2  in  Section  2,  then  apply  the  theory  to  a  special  case 
in  Section  3  for  an  illustration. 


Remark  1.  It  is  helpful  to  interpret  the  bias  bound  (5)  in  terms  of  the  equiv¬ 
alent  magnitude  of  estimation  error.  Under  the  standard  linear  model  (1),  the 
least  squares  estimate  4ls  is  unbiased  for  0.  We  measure  the  magnitude  of  its 
estimation  error  by  the  mean  squared  sine, 

MSS  =  £(sin*(/J,/3is)], 

where  the  sine  function  is  taken  with  respect  to  the  inner  product  (4).  It  is  easy 
to  verify  that 

MSS  ^  n-*(p-  1)(1  - 
where  n  is  the  sample  size. 

The  mean  squared  sine  approximately  equals  the  right  hand  side  of  the  bias 
bound  (5)  if 

_  P-  1 

If  the  sample  size  is  much  smaller  than  the  right  hand  side  of  (6),  bias  is  likely  to 
be  negligible  compared  with  estimation  error.  If  the  sample  size  is  much  larger 
than  the  right  hand  side  of  (6),  bias  might  dominate  estimation  error  if  the  true 
link  function  is  substantially  nonlinear. 

For  the  same  nonlinearity  measure  the  right  hand  side  of  (6)  is  pro¬ 

portional  to  p  —  1.  The  asymptotic  bias  is  less  serious  (compared  to  estimation 
error)  for  larger  p’s. 

Remark  2.  The  bias  bound  (5)  and  the  discussion  in  Remark  1  deal  with  the 
worst  case  biais  when  the  true  link  function  is  the  least  favorable.  For  a  specific 
empirical  study,  the  actual  bias  might  be  substantially  smaller  than  the  right 
hand  side  of  (5)  if  the  true  link  function  is  not  the  least  favorable. 

The  results  in  this  paper  can  be  used  as  a  screening  device  to  diagnose 
whether  we  might  have  a  serious  bias  if  the  true  link  function  is  substantially 
nonlinear.  If  the  right  hand  side  of  (5)  is  big,  the  empirical  scientist  should  be 
alerted  to  pay  more  attention  to  the  goodness  of  the  link  function  used  in  his 
working  model.  If  the  right  hand  side  of  (5)  is  small,  say,  compared  to  the  MSS 
discussed  in  Remark  1,  the  goodness  of  the  link  function  is  less  crucial. 


2.  Proof 


The  bias  bound  (5)  is  equivalent  to 


B? 

1  -  ft*  cos*(/3,/?£,s) 


<  min(ft^t/2(/J)). 
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It  is  easy  to  see  that  the  left  hand  side  is  bounded  from  above  by  .  We  now 
derive  the  other  bound.  In  the  derivation  below,  we  consider  several  geometric 
measures  such  as  norm  and  orthogonality;  all  of  these  are  taken  with  respect  to 
the  inner  product  (4). 

We  decompose  into 


where  c  is  a  scalar.  Since 

sin\0,0Ls)  =  \\0^\\^/\\0Ls\\\  R0  =  ll/3Ls||VVar(y), 

we  have 


R}  s\VL{fi,0Ls) 

1  -  R?  cos^p,0ts) 


=  ||/3-^||V(Var(y)-c^P||^). 


(7) 


Let  V  denote  the  slope  vector  for  the  least  squares  linear  regression  of  g(a  + 
0x)  on  (: 

vT  =  Cov{g{a  +  i3x),C(/Jx)'). 


If  T  does  not  have  full  rank,  v  is  not  uniquely  defined.  We  can  take  any  version 
of  V.  Consider  the  decomposition 


V  =  d/I  +  w,  w  1  /?,  (8) 

where  d  is  a  scalar.  We  claim  that 

c  =  d,  0^=^nTT.-\  (9) 


To  see  this,  note  that 

vT  =  Cov(p,C')  =  Cov(ff,x')  =  /3tsS, 

0T  =  Cov(/IC,C')  =  Cow{0xX)  =  Cov(/3x,x')  =  0T.. 
It  follows  that 

0Ls'^  =  vT  =  d/IE  +  wT, 

which  proves  (9). 

Let  r*  =  Var(y  —  v^).  It  follows  from  (8)  and  (9)  that 


Var(j/)  =  Var(vC)  +  =  c^||/31|^  +  wTw'  +  r^. 


(10) 
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Combining  (7),  (9)  and  (10),  we  have 


sin^(j3, /3i,s) 

1  -  Jt^cos^(/3,/3i,s) 


wTE  ^Tw'/(wTw'  -f-  r^) 


<  wTS  ’Tw'/wTw'. 


(11) 


We  want  to  maximize  the  right  hand  side  of  (11)  under  the  constraint  w  ±  j3, 
thus  we  consider  the  spectral  decomposition  of  TI1“*T  with  respect  to  T.  This 
is  equivalent  to  the  spectral  decomposition  of  T  with  respect  to  E.  If  w  is  an 
eigenvector  with  eigenvalue  i/  for  the  second  spectral  decomposition,  i.e.. 


wT  =  i/wE, 


(12) 


then  w  is  also  an  eigenvector  with  the  same  eigenvalue  for  the  first  spectral 
decomposition.  The  first  eigenvalue  for  the  spectral  decomposition  (12)  is  one, 
with  the  corresponding  eigenvector  being  w  a  /?.  All  other  eigenvectors  are 
orthogonal  to  /?.  It  follows  that  the  right  hand  side  of  (11)  is  maximized  by 
taking  w  to  be  the  second  eigenvector  for  the  spectral  decomposition  (12).  The 
maximum  is  the  second  eigenvalue  for  (12).  This  completes  the  proof  for 

Theorem  2. 

We  now  establish  that  the  bias  bound  (5)  is  tight.  For  a  given  /?  and  a 
given  Q(x),  the  “least  favorable”  GLM’s  have  no  pure  error,  c  =  0,  and  have  the 
following  link  function: 


j/ =  Sf(Q  + /?x)  =  c/3x  +  w*C(^x),  (13) 

where  w*  is  the  second  eigenvector  for  (12),  and  c  is  a  scalar  to  be  determined 
from  the  given  R}.  For  those  GLM’s,  we  have 

+  «/2(/3)^||w*lp)/(c^||/3||^  +  «^2(/3)||w*in. 

If  ii*  >  1^2 (/J),  we  have 

c'  =  (fi^  -  //2(/9))i^2(/3)||w*l|V(l  -  -R'Oli/lIl', 

thus  there  exists  a  least  favorable  GLM  of  form  (13)  which  has  the  given  and 
achieves  the  right  hand  side  of  the  bias  bound  (5). 

If  ii*  <  i'2(i3)»  there  does  not  exist  a  corresponding  least  favorable  GLM  of 
form  (13).  The  right  hand  side  of  (5)  can  be  achieved  in  this  case  with  the  GLM 

j/ =  w*C(/3x)  +  «,  c|x~A(0,r*), 


(13') 
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w}i6r6 

-  R^)u2m\w-\\^IR\ 

The  bias  bound  (5)  is  noninformative  in  this  case;  the  noncollinearity  between 
P  and  Pls  is  unrestricted.  For  the  GLM  (13'),  Pis  is  orthogonal  to  p. 

Remark  3.  The  bias  bound  allows  the  link  function  to  be  arbitrary.  If  we  have 
prior  information  on  the  link  function,  it  might  be  possible  to  sharpen  the  bound. 
For  example,  it  might  be  known  that  the  link  function  is  monotonic.  The  link 
function  for  the  least  favorable  GLM  (13)  might  not  be  monotonic.  Since 

Cov{w*  ({Px),  Px)  =  w*T,P'  =  0, 


WC(/3x)  is  not  monotonic  in  Px.  If  c  is  small  compared  to  l|w*||,  i.e.,  -  i'2{P) 

is  small,  the  link  function  is  not  monotonic.  If  we  restrict  to  monotonic  link 
functions,  it  might  be  possible  to  improve  upon  the  bias  bound  (5).  An  example 
is  given  in  Section  3. 


3.  A  Special  Case 


We  illustrate  the  application  of  Theorem  2  with  a  special  case  which  might 
be  of  interest  in  itself.  We  assume  x  is  uniformly  distributed  over  the  square 
(-1  <  <  1,-1  <  X2  <  1).  Let  P  =  (l,t).  Without  loss  of  generality,  we 

assume  0  <  i  <  1. 

If  t  =  0  or  t  =  1,  CiPx)  is  linear,  thus  Pj^s  is  collinear  with  p.  If  0  <  t  <  1, 
then  C(/?x)  =  (Ci(/3x),C2(i3x))  is  nonlinear: 


f  (/?x  -  1  +  t)/2, 
Ci(/3x)  =  I  Px, 

[  (/3x  4-  1  -  t)/2. 


if  Px  <  -1  +  t; 

if  -  1  +  t  <  /9x  <  1  -  t; 

if  /3x  >  1  -  t; 


C2(/3x) 


'  iPx+ I-  t)/2t, 
-  0, 

(/Jx  -  1  +  <)/2t. 


if  /?x  <  -1  +  1; 

if  -  1  +  t  <  /3x  <  1  -  t; 

if  /?x  >  1  -  t; 


thus  Pi^s  and  P  might  not  be  collinear.  It  is  eaisy  to  verify  that 


i/2(/l)  =  t(l  -t)V2<  2/27, 


where  the  maximum  occurs  at  t  =  1/3. 

We  have  tabulated  the  bias  bound  (5)  for  P  —  (1,1/3).  The  second  column 
in  Table  1  gives  the  maximal  angle  between  P  and  Pls  for  Rp's  ranging  from 
10%  to  90%.  The  asymptotic  bias  is  substantial  when  R^  is  not  close  to  one. 
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Table  1.  Maximal  angle  between  Pls  ^  =  (1, 1/3) 

Maximal  angle 


n  — 

Arbitrary  g  [1] 

Monotonic  g  [2] 

0.10 

58.1° 

12.5° 

0.20 

34.4° 

12.5° 

0.30 

25.6° 

12.5° 

0.40 

20.3° 

12.5° 

0.50 

16.4° 

12.5° 

0.60 

13.4° 

12.5° 

17/27 

12.5° 

12.5° 

0.70 

10.7° 

10.7° 

0.80 

8.1° 

8.1° 

0.90 

5.4° 

5.4° 

[1]  No  restrictions  on  the  link  function  g. 

[2]  Link  function  g  restricted  to  be  monotonic. 


For  p  =  2,  =  2/27,  the  right  hand  side  of  (6)  is  12.5,  which  is  quite 

small.  For  most  relevant  sample  sizes,  bias  might  dominate  estimation  error  if 
the  true  link  function  is  substantially  nonlinear. 

It  can  be  shown  that  the  least  favorable  GLM  (13)  is  monotonic  in  /3x  for 
R?  >  17/27,  and  is  not  monotonic  for  R?  <  17/27.  If  we  restrict  to  monotonic 
link  functions,  we  can  sharpen  the  bias  bound  for  R}  <  17/27:  the  maximal 
angle  is  identical  to  that  for  R?  =  17/27;  see  the  third  column  in  Table  1.  For 
small  R}'s,  the  bias  bound  is  sharpened  substantially. 
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